This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
library(Seurat)
library(ggplot2)
library(scCustomize)
library(readr)
library(pheatmap)
library(matrixStats)
library(spdep)
library(geojsonR)
Sets the path to the directory containing the output data - this is the directory where all of the outputs are stored.
data_xenium <- "/project/shared/spatial_data_camp/datasets/DATASET2/XENIUM_COLORECTAL_CANCER"
data_merscope <- "/project/shared/spatial_data_camp/datasets/DATASET2/MERSCOPE_COLORECTAL_CANCER"
xenium_CRC <- ReadXenium(data_xenium, outs = c("matrix", "microns"), type=c("centroids", "segmentations"))
10X data contains more than one type and is being returned as a list containing matrices of each type.
|--------------------------------------------------|
|==================================================|
?ReadXenium
starting httpd help server ... done
names(xenium_CRC)
[1] "matrix" "microns" "centroids" "segmentations"
Read in additional information about the cells - this gives us pre-calculated information, for example segmented cell or nucleus size for each cell.
cell_meta_data <- read.csv(file.path(data_xenium, "cells.csv.gz"))
rownames(cell_meta_data) <- cell_meta_data$cell_id
head(cell_meta_data)
We will start by creating a basic seurat object from the data.
CreateSeuratObject function initializes a Seurat object using the provided gene expression matrix and optional metadata.
counts: The gene expression matrix, which contains the raw count data for each gene in each cell. data$matrix[[“Gene Expression”]]: Specifies the gene expression matrix extracted from the loaded Xenium data. Here, we leave out the control probes for now.
assay: The name of the assay - you can call it anything you like. Here, we go with “XENIUM”.
meta.data: Metadata associated with the cells or spots. Here, we add the cell statistics we read in earlier as cell_meta_data.
By printing the seurat object, we can see that we read in ~ 30,000 cells with measures for 325 genes
When creating the Seurat object only on matrix can be created. Additional matrices would need to be added in separate steps later.
seurat <- CreateSeuratObject(counts = xenium_CRC$matrix[["Gene Expression"]],
assay = "XENIUM",
meta.data = cell_meta_data)
seurat
An object of class Seurat
325 features across 647524 samples within 1 assay
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
Adding spatial coordinates to a Seurat object allows for spatially resolved analysis and visualization. This requires creating objects for centroids and segmentations we read in earlier, and then integrating these with the main Seurat object.
CreateFOV: This function creates a field of view (FOV) object that includes spatial information about the centroids, segmentations, and molecule coordinates. An FOV can be the entire slide, or a selected region within a slide - i.e. it does not need to have entries for all the cells in the seurat object.
coords: A list containing the centroids and/or segmentation data. For larger datasets, it can be quicker to only load centroids, as this minimises the amount of data points.
centroids = CreateCentroids(data\(centroids)*: Creates a centroids object from the centroid data in the Xenium dataset. *segmentation = CreateSegmentation(data\)segmentations): Creates a segmentation object from the segmentation data in the Xenium dataset.
type = c(“segmentation”, “centroids”): Specifies the types of spatial data being included, which are segmentation and centroid data.
molecules = data$microns: The spatial coordinates of individual transcripts/molecules in the data. This is optional - for larger datasets, skipping transcript coordinates can be a good idea. In this case the transcripts will just be assigned by cell.
seurat[[“COLON”]] <- coords: Adds the created FOV object to the Seurat object under the new FOV name “COLON”. This can be named (almost) anything - but, avoid using underscores as this can create some unexpected behaviours later.
TIP: LoadXenium() is a wrapper that would load in both cell counts matrix and spatial coordinates in one function, simplifying these steps. However, in situ platforms are evolving at a very fast rate and there are constant changes on how the data is stored, in particular for file formats for cell segmentation and coordinates. Here, we have broken down the steps to show how to assemble an in situ seurat object from the key components, in case the platform specific readers don’t work for your specific data.
coords <- CreateFOV(coords = list(centroids = CreateCentroids(xenium_CRC$centroids),
segmentation = CreateSegmentation(xenium_CRC$segmentations)),
type = c("segmentation", "centroids"),
molecules = xenium_CRC$microns,
assay = "XENIUM")
seurat[["COLON"]] <- coords
Inspect the object - now, you can see we have added a spatial field of view:
seurat
An object of class Seurat
325 features across 647524 samples within 1 assay
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
1 spatial field of view present: COLON
ImageFeaturePlot(seurat, "nCount_XENIUM", axes = T) + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Create a x- and a y-filter based on the defined area of interest (based on image in Xenium Analyzer)
seurat$x_FILTER <- seurat$x_centroid <= 6000 & seurat$x_centroid >= 1000
seurat$y_FILTER <- seurat$y_centroid <= 4000 & seurat$y_centroid >= 2000
Finally, we can subset the seurat object based on chosen x & y area based on the previously created filters.
seurat <- subset(seurat, x_FILTER & y_FILTER)
Warning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating Seurat objects
ImageFeaturePlot(seurat, "nCount_XENIUM", axes = T) + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Adding control probes and codewords as separate assays in the Seurat object allows for the tracking and analysis of technical artifacts and noise within your spatial transcriptomics data, while keeping these outputs separate from the main biological gene expression values.
Unassigned codewords are unused codewords. There is no probe in a particular gene panel that will generate the codeword.
Negative control probes are probes that exist in the panels but target non-biological sequences. They can be used to assess the specificity of the assay.
Negative control codewords are codewords in the codebook that do not have any probes matching that code. They are chosen to meet the same requirements as regular codewords and can be used to assess the specificity of the decoding algorithm.
[[]] and $ can be mostly used interchangably. Be careful with using $ in seurat objects. Because I’m creating an assay both ([[]] and $) will know what to do with it.
seurat[["Negative.Control.Codeword"]] <- CreateAssayObject(counts = xenium_CRC$matrix[["Negative Control Codeword"]][, colnames(seurat)])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
seurat$Negative.Control.Probe <- CreateAssayObject(counts = xenium_CRC$matrix[["Negative Control Probe"]][, colnames(seurat)])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
seurat$Unassigned.Codeword <- CreateAssayObject(counts = xenium_CRC$matrix$`Unassigned Codeword` [, colnames(seurat)])
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
Equivalent to the above chunk but packed in a for loop. Takes all matrices from the xenium_CRC$matrix list except gene expression as this has been added to Seurat already. At this point with just 3 new assay objects needed, writing multiple lines of code is more sensible than packing it into a comparably complicated for loop.
for (i in setdiff(names(xenium_CRC$matrix), "Gene Expression")) {
seurat[[gsub(pattern = " ", replacement = ".", x = i)]] <- CreateAssayObject(counts = xenium_CRC$matrix[[i]][, colnames(seurat)])
}
Compare transcript and gene counts per cell on the created subset.
p_transcript <- ImageFeaturePlot(seurat, "nCount_XENIUM") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
p_gene <- ImageFeaturePlot(seurat, "nFeature_XENIUM") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
p_transcript + p_gene
Examine the distribution of the number of features (genes) detected per cell in the Seurat object using a density plot and calculates specific quantiles of this distribution.
quantile(seurat$nFeature_XENIUM, c(0.01, 0.1, 0.5, 0.9, 0.99))
1% 10% 50% 90% 99%
7 16 34 57 76
ggplot(seurat[[]], aes(nFeature_XENIUM)) +
geom_density() +
geom_vline(xintercept = quantile(seurat$nFeature_XENIUM, c(0.01, 0.1, 0.5, 0.9, 0.99)), lty = 2)
Plot the cell area
p_area <- ImageFeaturePlot(seurat, "cell_area") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
p_gene + p_area
ImageFeaturePlot(seurat, "nucleus_area") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
The cell-to-nucleus area ratio can also potentially provide insights into cell morphology, cell type and potential changes in cellular states or conditions. For example, T-Cells can often be quite well identified by this variable alone, as they have a small cytoplasm volume. However, without a cell boundary stain, this metric mainly captures segmentation artefacts, so be careful about over-interpretation!
seurat$nucleus_cell_ratio <- seurat$nucleus_area / seurat$cell_area
ImageFeaturePlot(seurat, "nucleus_cell_ratio") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Check cell size distribution as well as transcript density i.e. transcript count over cell area.
ggplot(seurat[[]], aes(cell_area)) + geom_density() + geom_vline(xintercept = quantile(seurat$cell_area, c(0.01, 0.9, 0.99)), lty = c(2,3,4))
ggplot(seurat[[]], aes(nCount_XENIUM, cell_area)) + geom_point()
Check where big cells with low counts are located to identify poorly covered areas (most likely due to gene inclusion in the panel).
ImageFeaturePlot(seurat, "area_count_ratio") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Create a filter for overly large cells in the sample and plot which cells would be removed if applying it.
Do the same thing for very small cells.
We can check how these values correlate with gene detection rate.
If we filter out small cells, we will remove cells with low numbers of genes detected.
If we filter out large cells, this is not that biased towards overly large counts, as we saw before.
p1 <- VlnPlot(seurat, "nFeature_XENIUM", group.by = "SIZE_FILTER_SMALL", pt.size = .05, alpha = .3) + labs(title="Small Cell Filter")
Warning: Default search for "data" layer in "XENIUM" assay yielded no results; utilizing "counts" layer instead.Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
p2 <- VlnPlot(seurat, "nFeature_XENIUM", group.by = "SIZE_FILTER_LARGE", pt.size = .05, alpha = .3)+ labs(title="Large Cell Filter")
Warning: Default search for "data" layer in "XENIUM" assay yielded no results; utilizing "counts" layer instead.Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
p1 + p2
The most important filter is the overall transcript detection. Empty cells or cells with very low transcript count cannot be taken forward for clustering analysis and it is extremely difficult to identify what they may be. Here, we set a threshold of minimum 15 transcripts.
seurat$TRANSCRIPT_FILTER <- seurat$nCount_XENIUM >= 15
ImageDimPlot(seurat, group.by = "TRANSCRIPT_FILTER")
Finally, visualizing the counts of negative control codewords, negative control probes, and unassigned codewords helps identify and understand technical artifacts and background noise in your spatial transcriptomics data.
Here, we can see that all control probes and codewords produce yield very little signal, suggesting our data is good quality!
In some cases, high amount of autoflourescence is the cells/tissue can sometimes generate false positive signal and this should be filtered out.
ImageDimPlot(seurat, group.by = "nCount_Negative.Control.Codeword")
ImageDimPlot(seurat, group.by = "nCount_Negative.Control.Probe")
ImageDimPlot(seurat, group.by = "nCount_Unassigned.Codeword")
Although the negative control signal is low, we can nonetheless create a filter to remove cells which have any, although in this case it is probably unnecessary.
seurat$PROBE_FILTER <- seurat$nCount_Unassigned.Codeword == 0 &
seurat$nCount_Negative.Control.Codeword == 0 &
seurat$nCount_Negative.Control.Probe == 0
ImageDimPlot(seurat, group.by = "PROBE_FILTER")
Finally, we can subset the seurat object based on any/all of the filters we have created earlier.
By combining probe, size, and transcript filters, you can retain only the cells that meet all quality criteria, reducing the impact of technical artifacts and noise on your analysis.
seurat_clean <- subset(seurat, PROBE_FILTER & SIZE_FILTER_LARGE & SIZE_FILTER_SMALL & TRANSCRIPT_FILTER)
Warning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating Centroids objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating FOV objectsWarning: Not validating Seurat objects
seurat
An object of class Seurat
541 features across 128841 samples within 4 assays
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
3 other assays present: Negative.Control.Codeword, Negative.Control.Probe, Unassigned.Codeword
1 spatial field of view present: COLON
seurat_clean
An object of class Seurat
541 features across 122015 samples within 4 assays
Active assay: XENIUM (325 features, 0 variable features)
1 layer present: counts
3 other assays present: Negative.Control.Codeword, Negative.Control.Probe, Unassigned.Codeword
1 spatial field of view present: COLON
Data Normalisation
The SCTransform function in Seurat is used for normalizing single-cell RNA-seq and spatial transcriptomics data. This method models the gene expression counts using a regularized negative binomial regression and removes technical noise while preserving biological variability. The clip.range parameter is used to limit the range of the transformed values, which can help stabilize downstream analyses by limiting the influence of extreme values.
seurat_clean <- SCTransform(seurat_clean, assay = "XENIUM", clip.range = c(-10, 10))
Running SCTransform on assay: XENIUM
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 325 by 122015
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 322 genes, 5000 cells
Found 43 outliers - those will be ignored in fitting/regularization step
Second step: Get residuals using fitted parameters for 325 genes
Computing corrected count matrix for 325 genes
Calculating gene attributes
Wall clock passed: Time difference of 23.03947 secs
Determine variable features
Centering data matrix
|
| | 0%
|
|========================================================================================================| 100%
Getting residuals for block 1(of 25) for counts dataset
Getting residuals for block 2(of 25) for counts dataset
Getting residuals for block 3(of 25) for counts dataset
Getting residuals for block 4(of 25) for counts dataset
Getting residuals for block 5(of 25) for counts dataset
Getting residuals for block 6(of 25) for counts dataset
Getting residuals for block 7(of 25) for counts dataset
Getting residuals for block 8(of 25) for counts dataset
Getting residuals for block 9(of 25) for counts dataset
Getting residuals for block 10(of 25) for counts dataset
Getting residuals for block 11(of 25) for counts dataset
Getting residuals for block 12(of 25) for counts dataset
Getting residuals for block 13(of 25) for counts dataset
Getting residuals for block 14(of 25) for counts dataset
Getting residuals for block 15(of 25) for counts dataset
Getting residuals for block 16(of 25) for counts dataset
Getting residuals for block 17(of 25) for counts dataset
Getting residuals for block 18(of 25) for counts dataset
Getting residuals for block 19(of 25) for counts dataset
Getting residuals for block 20(of 25) for counts dataset
Getting residuals for block 21(of 25) for counts dataset
Getting residuals for block 22(of 25) for counts dataset
Getting residuals for block 23(of 25) for counts dataset
Getting residuals for block 24(of 25) for counts dataset
Getting residuals for block 25(of 25) for counts dataset
Centering data matrix
|
| | 0%
|
|========================================================================================================| 100%
Finished calculating residuals for counts
Set default assay to SCT
Run PCA on the cleaned up and normalised data.
seurat_clean <- RunPCA(seurat_clean)
PC_ 1
Positive: CD24, SLC12A2, RRM2, PPP1R1B, TYMS, HMGB2, SOX9, CDCA7, CA2, FERMT1
CEACAM5, EPHB3, STMN1, C1QBP, AQP1, CMBL, SMOC2, PCLAF, MKI67, TK1
LGR5, REG4, EGFR, IMPDH2, DMBT1, S100P, MLPH, CREB3L1, GATA2, UBE2C
Negative: IGFBP7, THBS1, TIMP3, DPYSL3, CXCR4, VCAN, MAF, IFITM1, CD79A, CYBB
ETS1, CLU, TRAC, IL7R, TRBC2, MS4A1, ANXA1, CD3E, RPS4Y1, CD2
CTSB, DEPP1, CCL5, FZD7, PLXND1, SPOCK2, SERPINA1, SEC11C, SOCS3, RUNX1T1
PC_ 2
Positive: THBS1, IGFBP7, DPYSL3, TIMP3, VCAN, ALDH1B1, FZD7, AQP1, DEPP1, RUNX1T1
CDKN2B, CLU, CES1, CD24, RRM2, SELENOM, CKAP4, CPE, TUBA1A, FRZB
PRDX4, TYMS, MEIS2, FKBP11, EPHB3, SLC12A2, SEC11C, HMGB2, EGFR, IFITM1
Negative: APOE, CYBB, CTSB, RNASE1, SERPINA1, CD14, MS4A7, C1QC, CD163, C1QA
IL7R, CXCR4, C1QB, CCL4, FYB1, CCL5, TRBC2, CD2, CD8A, CD3E
MS4A1, TRAC, GPR183, CD83, GZMA, MAF, TNFSF13B, CTLA4, CD3D, GZMK
PC_ 3
Positive: MS4A1, CXCR4, TRBC2, TRAC, CD3E, CD79A, CD2, GZMK, CD8A, SPOCK2
IL7R, CTLA4, BANK1, ETS1, CCL5, TIGIT, CD3D, CD3G, KLRB1, GZMA
CD6, SPIB, CLU, LTB, LRMP, CST7, CD5, TRAT1, CCR7, TRBC1
Negative: CTSB, RNASE1, THBS1, APOE, CYBB, IGFBP7, SERPINA1, CD14, MS4A7, C1QC
TIMP3, CD163, C1QA, C1QB, PLXND1, VCAN, DPYSL3, FZD7, ALDH1B1, ANXA1
SOCS3, CES1, CPE, TNFSF13B, RUNX1T1, TUBA1A, IL1B, IL4I1, SELENOM, CDKN2B
PC_ 4
Positive: CD79A, MS4A1, SEC11C, CLU, BANK1, CXCR4, FKBP11, TNFRSF17, LRMP, DERL3
SPIB, PRDX4, RGS13, CD79B, FCRL1, SELENOK, SMIM14, IRF8, CYBB, PAX5
TCL1A, CD83, SELL, C2orf88, MS4A7, GPR183, FCER2, APOE, CD14, ARHGAP24
Negative: CD8A, CCL5, GZMA, CD2, CD3E, TRBC2, TRAC, CTLA4, CD3G, CD3D
NKG7, TIGIT, KLRB1, CCL4, GZMK, SPOCK2, THBS1, TIMP3, CD6, CST7
IL7R, IGFBP7, CD8B, GNLY, ETS1, CD5, FOXP3, TRBC1, ITK, MAF
PC_ 5
Positive: MS4A1, CLU, BANK1, CXCR4, SPIB, THBS1, DPYSL3, ALDH1B1, SMIM14, FCRL1
IRF8, CD83, TCL1A, CES1, PAX5, FZD7, SELL, DEPP1, LTB, ARHGAP24
ETS1, CCR7, MAOB, AQP1, CXCR5, CYBB, FCER2, CD24, PLXND1, ROBO1
Negative: SEC11C, CD79A, FKBP11, PRDX4, TNFRSF17, DERL3, SELENOK, CCL5, CD8A, GZMA
CKAP4, C2orf88, IFITM1, NKG7, RGS13, CCL4, GNLY, CD3E, VCAN, CDKN2B
CD2, FRZB, CD3G, CST7, KLRD1, CD8B, KLRC2, GZMK, CD3D, KLRB1
Create an elbow plot to decide how many PCs to take into account for UMAP creation.
Plot the top genes contributing to a specific principal component to get an idea of the underlying biological driver of the variation captured by that component. This type of plot highlights the genes with the highest loadings, which are the most influential in the principal component analysis.
PC_Plotting(seurat_clean, dim_number = 1)
PC_Plotting(seurat_clean, dim_number = 2)
PC_Plotting(seurat_clean, dim_number = 3)
The FeaturePlot function in Seurat is used to visualize the expression of a specific gene across cells in a given dimensionality reduction space (e.g., PCA). This helps to understand how the expression of a gene varies across the principal components.
FeaturePlot(seurat_clean, "SOX9", reduction = "pca") + scale_color_viridis_c()
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
We can also examine how various PCs are distributed spatially.
Here, we can see that high PC1 loadings enrich in follicular structures and low PC1 loadings enrich in crypt top cells.
ImageFeaturePlot(seurat_clean, "PC_1") + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
We can plot the expression of high (or low) loading genes to visualise how this correlates with our dimensionality reduction.
ImageFeaturePlot(seurat_clean, "SOX9", size=.5) + scale_fill_viridis_c()
Warning: No FOV associated with assay 'SCT', using global default FOVScale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Run UMAP (Uniform Manifold Approximation and Projection) for dimensionality reduction. dims: Specifies the principal components to use for UMAP.
FindNeighbors: Finding nearest neighbors helps to identify cells that are similar based on their PCA scores, which is used for clustering. reduction = “pca”: Specifies that the PCA space should be used for finding neighbors. dims: Specifies the principal components to use for identifying neighbors.
FindClusters: Clustering identifies distinct groups of cells with similar gene expression patterns. The resolution parameter controls the granularity of the clustering. resolution: Sets the resolution parameter for clustering. Higher values lead to more clusters, while lower values lead to fewer clusters.
Then visualise the clusters - firstly, based on transcriptome embedding.
After checking several resolution and comparing the DimPlots 0.7 was chosen as a sensible resolution.
Previous code: for (i in seq(0.5, 1, by = 0.25)) { assign(paste(“seurat_clean”, i, sep = “_“), FindClusters(seurat_clean, resolution = i)) } DimPlot(seurat_clean_0.5, label=T, repel=T) DimPlot(seurat_clean_0.75, label=T, repel=T) DimPlot(seurat_clean_1, label=T, repel=T)
seurat_clean <- FindClusters(seurat_clean, resolution = 0.7)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 122015
Number of edges: 3983224
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8649
Number of communities: 19
Elapsed time: 90 seconds
1 singletons identified. 18 final clusters.
seurat_clean <- FindClusters(seurat_clean, resolution = 0.7)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 122015
Number of edges: 3983224
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8649
Number of communities: 19
Elapsed time: 90 seconds
1 singletons identified. 18 final clusters.
DimPlot(seurat_clean, label=T, repel=T)
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
Plot the clusters in tissue space.
ImageDimPlot(seurat_clean, size=.5)
Warning: No FOV associated with assay 'SCT', using global default FOV
Use FindMarkers to identify marker genes for specific cell clusters.
FindMarkers: Identifies genes that are differentially expressed in a specified cluster compared to all other cells. ident.1 = “0”: Specifies the cluster of interest for which marker genes are to be identified. In this case, cluster “0”. max.cells.per.ident = 500: Limits the number of cells to be used from each cluster for the differential expression analysis to 500. This can help to speed up the computation.
markers_all <- FindAllMarkers(seurat_clean, max.cells.per.ident = 500)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14
Calculating cluster 15
Calculating cluster 16
Calculating cluster 17
Get the top 5 markers of the clusters.
head(markers)
head(markers_all)
top <- Extract_Top_Markers(markers_all, num_genes = 5, named_vector = F, make_unique = T, data_frame = T)
Visualise expression of cluster specific markers using feature plots
FeaturePlot(seurat_clean, "CTLA4", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
FeaturePlot(seurat_clean, "LGR5", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
FeaturePlot(seurat_clean, "CD14", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
FeaturePlot(seurat_clean, "MS4A1", label=T, repel=T)+ scale_color_viridis_c(direction=-1)
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Display the expression patterns of the top marker genes across clusters using Clustered_DotPlot function from the scCustomize package.
k: Determines the number of clusters for the hierarchical clustering of genes to enhance visual separation of expression patterns.
Most clusters have unique markers, which suggests the dataset is not over-clustered.
Clustered_DotPlot(seurat_clean, features = top, k=18)
[[1]]
[[2]]
Create a vector containing all genes present in the panel for further plotting and comparison.
genes_all <- rownames(seurat_clean)
genes_all
[1] "ABCC8" "ADH1C" "ADRA2A" "AFAP1L2" "AGTR1" "AKR1C3" "AKR7A3" "ALDH1B1" "ANK2"
[10] "ANO7" "ANXA1" "ANXA13" "APOE" "AQP1" "AQP8" "AREG" "ARHGAP24" "ARX"
[19] "ASCL2" "ATOH1" "AVIL" "AZGP1" "B3GNT6" "BANK1" "BATF" "BCAS1" "BEST2"
[28] "BEST4" "BMX" "BRCA2" "C1QA" "C1QB" "C1QBP" "C1QC" "C2orf88" "CA1"
[37] "CA2" "CA4" "CA7" "CADPS" "CALB2" "CCL20" "CCL4" "CCL5" "CCR7"
[46] "CD14" "CD163" "CD177" "CD2" "CD24" "CD3D" "CD3E" "CD3G" "CD40LG"
[55] "CD5" "CD6" "CD7" "CD79A" "CD79B" "CD83" "CD8A" "CD8B" "CDCA7"
[64] "CDHR5" "CDK15" "CDK6" "CDKN2B" "CEACAM1" "CEACAM5" "CEACAM6" "CEACAM7" "CEP126"
[73] "CES1" "CES2" "CFTR" "CHGA" "CHGB" "CHI3L2" "CHP2" "CKAP4" "CLCA1"
[82] "CLCA4" "CLEC9A" "CLU" "CMA1" "CMBL" "COL17A1" "COL19A1" "CPA3" "CPE"
[91] "CREB3L1" "CRYBA2" "CST7" "CTLA4" "CTSB" "CTSE" "CTSG" "CXCL3" "CXCR4"
[100] "CXCR5" "CYBB" "DEPP1" "DERL3" "DMBT1" "DNAJC12" "DNASE1L3" "DPYSL3" "DUOX2"
[109] "EBPL" "EGFR" "EPHB3" "ETS1" "ETV1" "FABP2" "FCER2" "FCN1" "FCRL1"
[118] "FCRLA" "FERMT1" "FEV" "FFAR4" "FKBP11" "FOXA3" "FOXP3" "FRZB" "FYB1"
[127] "FZD7" "GALNT5" "GALNT8" "GATA2" "GCG" "GIMAP7" "GNA11" "GNLY" "GPR183"
[136] "GPRC5C" "GPRIN3" "GUCA2A" "GUCA2B" "GZMA" "GZMK" "HDC" "HEPACAM2" "HES6"
[145] "HHLA2" "HMGB2" "HPGDS" "HRCT1" "HTR3E" "ICOS" "ID2" "IER5" "IFITM1"
[154] "IGFBP7" "IL17RB" "IL1B" "IL1RAPL1" "IL1RL1" "IL22" "IL32" "IL4I1" "IL7R"
[163] "IMPDH2" "INSL5" "INSM1" "IRF8" "ISL1" "ITK" "ITLN1" "KIF5C" "KIT"
[172] "KLK1" "KLRB1" "KLRC2" "KLRD1" "KRT1" "KRT86" "KRTCAP3" "L1TD1" "LCN2"
[181] "LEF1" "LEFTY1" "LEPROTL1" "LGALS2" "LGR5" "LILRA4" "LRMP" "LTB" "LYVE1"
[190] "MAF" "MAOB" "MB" "MCEMP1" "MEIS2" "MKI67" "MLPH" "MS4A1" "MS4A12"
[199] "MS4A2" "MS4A7" "MS4A8" "MSLN" "MT1A" "MUC12" "MYH14" "NEUROD1" "NKG7"
[208] "NKX2-2" "NOSIP" "NOVA1" "NXPE4" "ODF2L" "OLFM4" "OTOP2" "PAX4" "PAX5"
[217] "PBK" "PCLAF" "PDE4C" "PDZK1IP1" "PI3" "PLCE1" "PLPP2" "PLXND1" "PPP1R1B"
[226] "PRDX4" "PROX1" "PRPH" "PSTPIP2" "PTGER4" "PTTG1" "RAB26" "RAB3B" "RAP1GAP"
[235] "RARRES2" "RBP4" "REG4" "REP15" "RETNLB" "RFX6" "RGMB" "RGS13" "RHOV"
[244] "RIIAD1" "RNASE1" "RNF43" "ROBO1" "ROBO2" "RORA" "RORC" "RPS4Y1" "RRM2"
[253] "RUNX1T1" "S100A12" "S100P" "SCG2" "SCG3" "SCG5" "SCGN" "SCNN1A" "SDCBP2"
[262] "SEC11C" "SELENBP1" "SELENOK" "SELENOM" "SELL" "SERPINA1" "SFXN1" "SH2D6" "SH2D7"
[271] "SIT1" "SLC12A2" "SLC18A2" "SLC26A2" "SLC26A3" "SLC29A4" "SLC6A8" "SLPI" "SMIM14"
[280] "SMIM6" "SMOC2" "SOCS3" "SOX9" "SPDEF" "SPIB" "SPOCK2" "STMN1" "STXBP6"
[289] "SULT1B1" "SVOPL" "TBC1D4" "TCL1A" "TFF1" "THBS1" "TIGIT" "TIMP3" "TK1"
[298] "TKT" "TMEM61" "TMIGD1" "TNFAIP3" "TNFRSF17" "TNFRSF25" "TNFSF13B" "TPSG1" "TRAC"
[307] "TRAT1" "TRBC1" "TRBC2" "TRDV1" "TRGV4" "TRPM5" "TTR" "TUBA1A" "TUBB"
[316] "TYMS" "UBE2C" "UCN3" "UGT2A3" "UGT2B17" "VCAN" "VPREB3" "VWA5B2" "WFDC2"
[325] "XCL2"
Additional Spatial Visualisations
For better visualisation of spatial distribution of clusters, subset only certain groups to reduce crowding.
WhichCells: Identifies cells based on specified criteria. seurat: The Seurat object. expression = seurat_clusters %in% c(0, 5): Logical expression to select cells belonging to clusters 0 and 5.
This works with ImageFeaturePlot too. Try it with some genes!
Choose an area to examine in more detail, if you want.
ImageDimPlot(seurat_clean, axes = T)
Warning: No FOV associated with assay 'SCT', using global default FOV
Create a new FOV with coordinates of interest by using the Crop function.
seurat[[“COLON”]]: The spatial assay to be cropped. x: The x-axis range for the crop. y: The y-axis range for the crop. coords: Specifies the coordinate system to use (typically “plot” for spatial coordinates) plot uses coordinates of a ‘normal plot’ with x as the horizontal axis and y as the vertical axis tissue uses the image coordinates as displayed.
seurat[[“ROI1”]] <- cropped: Adds the cropped region as a new FOV named “ROI1” in the Seurat object. This could be a more informative name, but avoid using underscores!
options(future.globals.maxSize = Inf) -> sets the maximum memory limit to infinite
cropped <- Crop(seurat_clean[["COLON"]], x = c(1600, 1900), y = c(2700, 3400), coords = "tissue")
Warning message:
In png(..., res = dpi, units = "in") :
unable to open connection to X11 display ''
cropped <- Crop(seurat_clean[["COLON"]], x = c(1600, 1900), y = c(2700, 3400), coords = "tissue")
seurat_clean[["ROI1"]] <- cropped
Warning: Key ‘XENIUM_’ taken, using ‘roi1_’ instead
Now we can limit our visualisations just to this region by specifying the name of the new FOV as an “fov” arguement.
As we are zooming in closer to the tissue, we can also switch from plotting cell centroids (i.e. dots) by default to visualising cell segmentation boundaries. Plotting cell boundary polygons for large FOVs can be quite time consuming, and doesn’t provide much more detail on a fully zoomed-out view.
ImageDimPlot(seurat_clean, fov="ROI1", boundaries="segmentation", border.color = "black" )
Gene expression or other continous variable can be visualised on the new FOV as before.
ImageFeaturePlot(seurat_clean, "MS4A1", fov="ROI1", boundaries="segmentation" , border.color = "black") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat_clean, "SOX9", fov="ROI1", boundaries="segmentation" , border.color = "black") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
ImageFeaturePlot(seurat_clean, "CD163", fov="ROI1", boundaries="segmentation" , border.color = "black") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
We can also overlay the coordinates of individual molecules to the plot.
This visualisation can be useful because molecules are stored independently of cells and cell boundaries in Seurat. Therefore, if there are regions where cell segmentation is not good, or if cells were filtered out from clustering analysis due to their low quality, the molecules will remain and can still be visualised this way.
Alpha of molecules could possibly be changed by adding another ggplot layer (geom_point) that contains the molecule data.
ImageFeaturePlot(seurat_clean, "CD3E", fov="ROI1", boundaries="segmentation", molecules=c("CD8A", "FOXP3"), mols.size = .5, mols.cols = c(adjustcolor("red", alpha.f = 0.2), adjustcolor("blue", alpha.f = 0.2)), border.color = "black") + scale_fill_viridis_c()
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.
Cell Type Identification
You can manually annotate your cell clusters, or you can classify them using a reference single-cell dataset. This process is simpler than for Visium data because our data is at the single-cell level, establishing a one-to-one relationship without the need for spot deconvolution.
However, our transcriptome is more limited here, and some cell types may not be well represented. Additionally, our single-cell reference might be missing some cell types that are not well captured by droplet-based technologies but are present in our tissue data.
In this example, we will use a single-cell reference dataset that we prepared earlier.
We will start by reading in the seurat RDS file.
ref <- readRDS("/project/shared/spatial_data_camp/datasets/SINGLE_CELL_REFERENCES/COLON_HC_5K_CELLS.RDS")
Examine the object:
ref
An object of class Seurat
33556 features across 5725 samples within 3 assays
Active assay: RNA (33538 features, 2000 variable features)
3 layers present: counts, data, scale.data
2 other assays present: HTO, ADT
2 dimensional reductions calculated: pca, umap
And plot the pre-computed cell clusters of the reference data set. We can see that here we have quite high level annotation.
DimPlot(ref, label = T, repel = T)
We want to evaluate how much structural information is lost in single-cell data when limiting ourselves to the targeted gene set. Accurate cluster prediction is challenging if the current gene set does not adequately identify them. To do this, we will quickly re-embedd the data using only the genes present in our spatial transcriptomics data and keep the original cluster annotations derived from unbiased data.
In this example, we can observe that the limited gene set does a reasonably good job at distinguishing major cell populations. However, it struggles to differentiate between similar cell types, such as myofibroblasts and fibroblasts, as effectively as before.
ref <- SCTransform(ref, residual.features =genes_all)
Running SCTransform on assay: RNA
Computing residuals for the 325 specified features
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Calculating cell attributes from input UMI matrix: log_umi
Variance stabilizing transformation of count matrix of size 19008 by 5725
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
Found 80 outliers - those will be ignored in fitting/regularization step
Skip calculation of full residual matrix
Calculating gene attributes
Wall clock passed: Time difference of 25.31267 secs
Determine variable features
Setting min_variance based on median UMI: 0.16
Calculating residuals of type pearson for 319 genes
|
| | 0%
|
|==================================================== | 50%
|
|========================================================================================================| 100%
Centering data matrix
|
| | 0%
|
|========================================================================================================| 100%
Place corrected count matrix in counts slot
Set default assay to SCT
ref <- RunPCA(ref)
PC_ 1
Positive: IGFBP7, SOCS3, APOE, IFITM1, TUBA1A, TIMP3, ANXA1, SELENOM, FRZB, VCAN
DEPP1, TNFAIP3, THBS1, CXCR4, CCL5, IRF8, RARRES2, CLU, GIMAP7, CD83
CD3E, TUBB, GPR183, CCL4, IL7R, PTGER4, ETS1, CD7, RORA, DPYSL3
Negative: GUCA2A, SLC26A3, CEACAM7, CA1, CA2, SLC26A2, CEACAM1, AQP8, CEACAM5, CDHR5
MS4A12, CA4, SDCBP2, GUCA2B, SELENBP1, MYH14, CD24, MUC12, CD177, CLCA4
TMIGD1, HRCT1, CEACAM6, AREG, CES2, C2orf88, GNA11, IL32, SLC6A8, CDKN2B
PC_ 2
Positive: IGFBP7, SOCS3, APOE, GUCA2A, IFITM1, TIMP3, CEACAM7, SLC26A3, TUBA1A, CA4
AQP8, CEACAM1, SELENOM, MS4A12, GUCA2B, VCAN, THBS1, DEPP1, CEACAM5, SLC26A2
IL32, FRZB, CDHR5, TMIGD1, CLCA4, CD177, SDCBP2, HRCT1, CA1, ANXA1
Negative: CXCL3, ADH1C, WFDC2, PPP1R1B, LEFTY1, CCL5, DERL3, CD24, STMN1, UBE2C
CD79A, C1QBP, PTTG1, KRTCAP3, CCL20, AREG, TYMS, SLPI, SEC11C, OLFM4
CD3E, LGALS2, HMGB2, CDCA7, RHOV, CXCR4, SOX9, GPR183, TKT, LCN2
PC_ 3
Positive: CXCL3, ADH1C, APOE, WFDC2, CD24, SOCS3, IGFBP7, RARRES2, PPP1R1B, LEFTY1
SLPI, SELENBP1, KRTCAP3, UBE2C, C1QBP, STMN1, LGALS2, UGT2B17, CA2, LCN2
AREG, THBS1, OLFM4, TIMP3, TYMS, NXPE4, CES2, SOX9, CMBL, RHOV
Negative: CCL5, CD3E, ANXA1, CXCR4, IL7R, IL32, TNFAIP3, CD7, GPR183, KLRB1
CST7, LTB, CD3D, FYB1, CD2, NKG7, LEPROTL1, CCL4, SPOCK2, TRBC2
TRBC1, CD8A, CD83, PTGER4, XCL2, CCR7, RORA, CEACAM7, SELENOK, AQP8
PC_ 4
Positive: DERL3, CD79A, TNFRSF17, FKBP11, CCL4, SEC11C, CD83, CA4, AQP8, MS4A1
VPREB3, CD79B, GUCA2B, CEACAM7, PRDX4, GUCA2A, CEACAM1, C1QA, C1QC, C1QB
DNASE1L3, FCRLA, SELENOM, BANK1, MS4A7, CLCA4, FCER2, LRMP, IL1B, TCL1A
Negative: CXCL3, ANXA1, CD3E, ADH1C, CD24, IL7R, WFDC2, CA1, IL32, CA2
APOE, CD7, SELENBP1, KLRB1, AREG, PPP1R1B, CCL20, TNFAIP3, RARRES2, SLPI
SOCS3, CCL5, LEPROTL1, THBS1, LEFTY1, IFITM1, LGALS2, CD3D, UBE2C, STMN1
PC_ 5
Positive: APOE, CA1, CA2, SLC26A2, SELENBP1, THBS1, SLC26A3, CCL4, VCAN, CD24
DERL3, TNFAIP3, IRF8, MUC12, SOCS3, GPR183, CCL5, CD79A, CDHR5, CES2
LGALS2, C1QA, TNFRSF17, C1QC, CD14, C1QB, AREG, UGT2B17, ADH1C, FKBP11
Negative: CA4, IGFBP7, CLU, RNASE1, GIMAP7, TUBA1A, CA7, BEST4, STMN1, SPIB
GUCA2B, WFDC2, OTOP2, HEPACAM2, GUCA2A, AQP1, AQP8, ANXA1, TIMP3, IL32
CEACAM6, UBE2C, IFITM1, CEACAM1, KLK1, HES6, RETNLB, TUBB, CLCA4, CTSE
ref <- RunUMAP(ref, dims=1:20)
11:08:53 UMAP embedding parameters a = 0.9922 b = 1.112
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
11:08:53 Read 5725 rows and found 20 numeric columns
11:08:53 Using Annoy for neighbor search, n_neighbors = 30
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
11:08:53 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
11:08:54 Writing NN index file to temp file /project/.tmpRsessions/RtmpSfvIpt/file1d0a1f32a8f48a
11:08:54 Searching Annoy index using 1 thread, search_k = 3000
11:08:56 Annoy recall = 100%
11:09:03 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
11:09:13 Initializing from normalized Laplacian + noise (using RSpectra)
11:09:13 Commencing optimization for 500 epochs, with 231718 positive edges
Using method 'umap'
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
11:09:26 Optimization finished
DimPlot(ref, label=T, repel=T)
If we visualise the specificity of the gene panel across our single cell reference clusters, we can see that the panel coverage is mainly concentrated across epithelial cells and T-Cells and other immune cells, with few specific markers expressed by stromal cells.
ps <- AggregateExpression(ref, features = rownames(seurat), normalization.method = "LogNormalize", assays="RNA", return.seurat = T)
|
| | 0%
|
|========= | 9%
|
|=================== | 18%
|
|============================ | 27%
|
|====================================== | 36%
|
|=============================================== | 45%
|
|========================================================= | 55%
|
|================================================================== | 64%
|
|============================================================================ | 73%
|
|===================================================================================== | 82%
|
|=============================================================================================== | 91%
|
|========================================================================================================| 100%
ps <- ScaleData(ps, features=rownames(ps))
Centering and scaling data matrix
|
| | 0%
|
|========================================================================================================| 100%
pheatmap(LayerData(ps, layer="scale.data"), show_rownames = F)
Next, we can use the standard Seurat integration and cross-classification workflow to transfer single-cell derived labels to our spatial object.
Briefly, the first function identifies anchors between the reference single-cell dataset (ref) and the query spatial dataset (seurat). Anchors are pairs of cells that are considered similar between the datasets. The normalization.method = “SCT” specifies that SCTransform normalization should be used.
The second step transfers the cell type labels from the reference dataset to the query dataset. The anchorset argument specifies the anchors found in the previous step. The refdata = ref$CellType argument specifies the cell type labels from the reference dataset to be transferred. The prediction.assay = TRUE argument indicates that the transferred labels should be stored in a new assay in the query dataset. The weight.reduction = seurat[[“pca”]] argument specifies the dimensionality reduction to be used for weighting the transfer, and dims = 1:30 specifies the number of dimensions to use.
anchors <- FindTransferAnchors(reference = ref,
query = seurat_clean,
normalization.method = "SCT")
Normalizing query using reference SCT model
Warning: No layers found matching search pattern providedPerforming PCA on the provided reference using 319 features as input.
Projecting cell embeddings
Finding neighborhoods
Finding anchors
Found 5874 anchors
seurat_clean <- TransferData(anchorset = anchors,
refdata = ref$CellType,
prediction.assay = TRUE,
weight.reduction = seurat[["pca"]],
query = seurat_clean,
dims=1:30)
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
Warning: Layer counts isn't present in the assay object; returning NULL
Unfortunately, the predicted labels and spatial clusters do not correspond clearly in all cases. This discrepancy is particularly evident in the middle regions of the UMAP, where many cells are predicted as epithelial cells - probably incorrectly!
How to improve this?
Ensure Good Representation of Cell Type Markers in in situ Target Panel Most critically, before undertaking any experiments you want to ensure that there is good representation of all cell types in your target panel - in this case, there is not much to be done as the data has already been generated.
Review and Refine Reference Data: Ensure that the reference single-cell dataset is comprehensive and accurately annotated. If certain cell types are not well represented or annotated in the reference dataset, it can lead to misclassification.
Increase the Number of Dimensions: Increasing the number of dimensions used in the UMAP and PCA steps might capture more variance in the data, leading to better label transfer.
Filter and Preprocess Data: Filtering out low-quality cells or genes and performing additional preprocessing steps can enhance the accuracy of the transfer anchors and, consequently, the label predictions.
Manually Annotate or Correct Predictions: In cases where automatic label transfer is insufficient, consider manually annotating or correcting the predictions for critical regions to ensure accuracy.
Define a named vector with colors to keep same colors for same cells in UMAP and Spatial Plots.
coul <- brewer.pal(min(length(unique(seurat_clean$predicted.id)), 11), "Paired")
coul <- colorRampPalette(coul)(length(unique(seurat_clean$predicted.id)))
names(coul) <- unique(seurat_clean$predicted.id)
DimPlot(seurat_clean, group.by = "predicted.id", cols = coul)
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
As before, we can also visualise the predicted cell labels in tissue space.
ImageDimPlot(seurat_clean, group.by = "predicted.id", cols = coul)
Warning: No FOV associated with assay 'SCT', using global default FOV
ImageDimPlot(seurat_clean, fov = "ROI1", group.by = "predicted.id", cols = coul, boundaries = "segmentation", border.color = "black")
In order to use the predicted cluster names instead of just the cluster numbers set the identity in the seurat object to the predicted.id and use FindMarkers to extract the markers for that cluster. Check if the top 10 marker genes seem to really point at glia cells. What are glial cells doing in the follicle?
Idents(seurat_clean) <- seurat_clean$predicted.id
Gliamarkers <- FindMarkers(seurat_clean, ident.1 = "Glia", max.cells.per.ident = 500)
Glia_top10 <- head(markers[order(markers$p_val), ], 10)
Glia_top10
CLU (Clusterin): This gene encodes a protein involved in various cellular processes, including apoptosis, cell cycle regulation, and DNA repair. It is often associated with neurodegenerative diseases and cancers.
DPYSL3 (Dihydropyrimidinase Like 3): This gene is involved in nervous system development and axon guidance. It plays a role in neuronal growth cone collapse and cell migration
THBS1 (Thrombospondin 1): This gene encodes a glycoprotein that mediates cell-to-cell and cell-to-matrix interactions. It is involved in processes like angiogenesis, wound healing, and tumorigenesis
CD24: This gene encodes a sialoglycoprotein involved in cell adhesion and signaling. It is expressed on mature granulocytes and B cells and is associated with various cancers
SLC12A2 (Solute Carrier Family 12 Member 2): This gene encodes a protein that mediates sodium and chloride transport. It is important for maintaining ionic balance and cell volume, and is associated with various epithelial cells
PPP1R1B (Protein Phosphatase 1 Regulatory Inhibitor Subunit 1B): Also known as DARPP-32, this gene is involved in dopaminergic and glutamatergic signaling. It is important in neurological and psychiatric disorders
ALDH1B1 (Aldehyde Dehydrogenase 1 Family Member B1): This gene is involved in alcohol metabolism and detoxification. It is expressed in various tissues, including the liver and pancreas
SOX9: This gene encodes a transcription factor crucial for chondrocyte differentiation and skeletal development. It also plays a role in sex determination
IGFBP7 (Insulin-like Growth Factor Binding Protein 7): This gene regulates the availability of insulin-like growth factors (IGFs) and is involved in cell growth and adhesion. It is implicated in various cancers
HMGB2 (High Mobility Group Box 2): This gene encodes a chromatin-associated protein involved in transcription, chromatin remodeling, and DNA repair. It is expressed in various cell types, including those in the immune system
These genes are often associated with glial cells, which are non-neuronal cells in the central nervous system that provide support and protection for neurons. Let me know if you need more detailed information on any of these genes!
Identify areas with low prediction scores.
FeaturePlot(seurat_clean, "predicted.id.score")
Rasterizing points since number of points exceeds 100,000.
To disable this behavior set `raster=FALSE`
To improve these artefacts, alternative cell segmentation algorithms could be used. What works best is very tissue dependant and there’s no easy one stop solution to this. Cell segmentation algorithms can be divided into a few groups.
Nuclei-based Segmentation algorithms primarily focus on identifying cell nuclei, which are usually more distinct and easier to detect than the cell boundaries. Once the nuclei are identified, the cell boundaries are inferred by expanding around the nuclei. This approach works well in tissues where the nuclei are clearly visible and distinct and in early versions of many in situ platforms, were the only available methods due to only using DAPI stain.
Cell Boundary-Based Segmentation algorithms (e.g. Cellpose) directly segments cells by identifying their boundaries. It is particularly effective for images with complex cell shapes and varying sizes, but this required good cell boundary staining - this is not available for our test dataset. Often cell boundary staining can be non-uniform across different tissues, adding further difficulties. Cellpose version 3 incorporates user-guided model training, which can be very useful for difficult to segment cell types - but this requires time investment to annotate training examples.
Transcript-Density Based Segmentation algorithms, like Baysor segments cells based on the spatial distribution of transcripts. It uses Bayesian inference to assign transcripts to cells, considering both the density and distribution of RNA molecules. This can be very useful for improving cell segmentation where cell boundary stain is not available or not working well.
Spatial Neighbourhood Analyis
Spatial neighbourhood analysis identifies cells that are spatially close to each other within a tissue section. This technique helps to understand the spatial organization and potential interactions between cells. The same principles used in Visium data can be applied to in situ data.
GetTissueCoordinates: Retrieves the spatial coordinates of the centroids from the Seurat object. which = “centroids”: Specifies that the centroids’ coordinates should be retrieved. rownames(coords) <- coords$cell: Sets the row names of the coords data frame to the cell IDs.
FindNeighbors: Identifies the nearest neighbours for each cell based on their spatial coordinates. as.matrix(coords[, c(“x”, “y”)]): Converts the x and y coordinates to a matrix format. k.param = 20: Specifies the number of nearest neighbours to identify for each cell. return.neighbor = TRUE: Ensures that the function returns the neighbour indices and distances.
TIP: This approach identifies spatial neighbours. If you analysis requires precise identification of directly adjacent or interacting cell neighbours, then a delaunay network based approach would be more appropriate. R package Giotto implements some nice functionalities based on this
coords <- GetTissueCoordinates(seurat, which = "centroids")
rownames(coords) <- coords$cell
neighbours <- FindNeighbors(as.matrix(coords[, c("x", "y")]), k.param = 20, return.neighbor=TRUE)